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Abstract 

Background: Large-scale chromosomal deletions or other non-specific perturbations of the transcriptome can alter 
the expression of hundreds or thousands of genes, and it is of biological interest to understand which genes are most 
profoundly affected. We present a method for predicting a gene's expression as a function of other genes thereby 
accounting for the effect of transcriptional regulation that confounds the identification of genes differentially 
expressed relative to a regulatory network. The challenge in constructing such models is that the number of possible 
regulator transcripts within a global network is on the order of thousands, and the number of biological samples is 
typically on the order of 1 0. Nevertheless, there are large gene expression databases that can be used to construct 
networks that could be helpful in modeling transcriptional regulation in smaller experiments. 

Results: We demonstrate a type of penalized regression model that can be estimated from large gene expression 
databases, and then applied to smaller experiments. The ridge parameter is selected by minimizing the 
cross-validation error of the predictions in the independent out-sample. This tends to increase the model stability and 
leads to a much greater degree of parameter shrinkage, but the resulting biased estimation is mitigated by a second 
round of regression. Nevertheless, the proposed computationally efficient "over-shrinkage" method outperforms 
previously used LASSO-based techniques. In two independent datasets, we find that the median proportion of 
explained variability in expression is approximately 25%, and this results in a substantial increase in the signal-to-noise 
ratio allowing more powerful inferences on differential gene expression leading to biologically intuitive findings. We 
also show that a large proportion of gene dependencies are conditional on the biological state, which would be 
impossible with standard differential expression methods. 

Conclusions: By adjusting for the effects of the global network on individual genes, both the sensitivity and reliability 
of differential expression measures are greatly improved. 



Background 

A goal of systems biology is to understand how a per- 
turbation affects a network of interrelationships between 
genes. There are many models for gene networks, but few 
have shown accurate predictions across many datasets [1]. 
There is a growing collection of gene perturbation exper- 
iments in which a subset of transcipts' expression have 
been modulated, either through carcinogenesis, chemi- 
cal treatments, or through siRNA [2]. These experiments 
comprise an information-rich dataset that allows us to 



"Correspondence: gelfondjal@uthscsa.edu 

1 Department of Epidemiology and Biostatistics, University of Texas Health 

Science Center San Antonio, San Antonio, Texas, USA 

Full list of author information is available at the end of the article 



construct global network models and test their predictive 
accuracy on the level of the full transcriptome. 

Network models can be helpful for distinguishing 
between all differentially expressed genes and the genes 
that are immediately affected by a perturbation such as a 
gene deletion. There are clinically relevant reasons for this 
line of investigation. For instance, chromosomal abnor- 
malities are the most common cause of mental retardation 
in the US, and the deletion of a chromosomal segment is a 
common subtype [3]. The deleted segments contain many 
genes, so that this is a type of sporadic, multiple gene 
knockout. The deletions are often hemizygous; that is, only 
one of the two (maternal or paternal) homologous chro- 
mosomes with the segment remains intact. Hence, the 
genes within the deleted segments are present with half of 
the normal copy number. Most genes exhibit some form of 
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dosage-sensitive decrease in expression (e.g., DNA dele- 
tion tends to decrease mRNA expression), but there is a 
possibility that the genes regulatory network can compen- 
sate for the loss by selectively increasing the expression 
of the intact copy. The ability or inability of the regula- 
tory networks to compensate with the intact copies on 
the homologous chromosome could explain which genes 
are responsible for common abnormal phenotypes. In this 
paper, we examined subjects with nearly identical somatic 
deleted segments of the q-arm of chromosome 18, and 
we performed comparative gene expression microarrays 
of these subjects with a parental control sample. Our 
goal is to: 1) identify the global gene expression differ- 
ences, 2) to identify which genes are the most affected by 
the global expression differences relative to their regula- 
tory networks, and 3) to identify genes that have different 
regulatory networks in case and control groups. 

Various methods have been developed both for pre- 
dicting gene expression and adjusting for the correlations 
between genes in differential expression analyses. Dahl 
et al [4] improved the detection of differential expres- 
sion by grouping genes into co-expression clusters with a 
Dirichlet process mixture model. Leek at al [5] proposed 
a general framework for significance testing when a large- 
scale number of components have mutual dependency. 
This method may be applied to the differential expression 
problem and allows independent testing of hypotheses by 
conditioning on an orthogonal dependence kernel. These 
methods have shortcomings because they do not explicitly 
utilize the vast amount of data from prior experiments. 

Some network methods do use extensive prior data 
such as Ruan et al [6] who used the cluster average of 
a set of co-expressed genes as a prediction for another 
gene within the cluster. This relatively simple method had 
similar or superior accuracy to models that used sub- 
stantial auxiliary data including regulatory pathways and 
DNA-binding patterns of transcription factors such as 
the approach taken by Gustafsson et al [7] that used an 
"elastic-net" penalty [8] and information from multiple 
genomic modalities (microarray, ChlP-seq, etc.). Along 
the lines of Ruan et al [6], we present a simple and compu- 
tationally efficient method for predicting gene expression. 
Our method extends the work of Congrove et al [9] who 
modeled a genes expression as linearly dependent on 
other genes, but we have added features that improve 
computational efficiency and robustness to variations in 
transcriptional networks. 

Methods 

Suppose we have a series of microarrays represented by 
a matrix Y that is G x N where G is the number of the 
genes on the N arrays, and the expression measurements 
for each gene g will be the row vector y g , g e [ 1 . . . G] . We 
define the perturbation vector X as an N x 1 vector with 



elements X[ = l(i array is perturbed) where /(•) is the 
indicator function. Our goal is to estimate the effect of the 
perturbation conditional on the gene expression network 
P(y g \X, Y. g ) where Y_ g = {y g , : g' £ g,g' e [ 1 . . . G] }. 
That is, we want to predict the effects of the perturbation 
on a gene given that it is within a regulatory network con- 
sisting of all other genes. Typically, the sample size (N) 
of the experimental data Y is small < 10, but we assume 
that a database of experiments Yd (G x Nd) is also avail- 
able with a substantially larger sample size Nd > 100. 
Our method will use this database Yd as an independent 
training set to estimate the regulatory network within Y. 

We make the simplifying assumption that the log- 
transformed gene expression y g follows an approximately 
Gaussian distribution with mean /3 g Y. A predictive model 
for each gene within Y and Yd can be constructed by using 
Sparse Simultaneous Equations Model (SSEM) [9,10]. 
Under this model, the gene expression of a transcript y g 
can be estimated by a weighted sum of the remaining tran- 
scripts. The key parameters are within /3 g (1 x G) where 
the figgf elements are the influence that gene g f has on gene 
g and represent the overall transcription network with 
figg = 0 so that genes do not influence themselves. These 
Pg parameters can be estimated from a database of related 
array experiments. Thus, we have the model 

y g = P g Y + €g (1) 

where e g is an 1 x N vector of Gaussian errors. This may 
be expressed in matrix form as 

Y = BY + E (2) 

where B is G x G and composed of the figg parameters 
such that B has diagonal elements consisting of zeros, and 
E is the GxN matrix of Gaussian errors with rows e g . The 
intrinsic gene network can be described by B. Cosgrove 
et al [9] model the effect of the perturbation of this 
network by introducing (p g 

y g = /3 g Y + (f> g + e g , (3) 

and in matrix form 

Y = BY + O + E. (4) 

The O parameter is the GxN matrix composed of 
(j) g and is the direct effect of the perturbation X on gene 
expression that is not accounted for by the gene network. 
Their goal is to estimate these direct actions O upon gene 
expression to elucidate, for example, the direct drug tar- 
gets. They estimated the B matrix by using compendia or a 
large (N > 100) database of experiments Yd, and estimate 
O using the residuals 

r = yP ert - ByP ert = ^ ert + e pert . (5) 

Cosgrove et al [9] used <p pert as the estimator of the direct 
action of the perturbation. Further, they did not take into 
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account that the network B may not be the same in differ- 
ent cell types or biological conditions, which we consider 
in our approach below. 

Although it was originally applied to large genomic 
databases, this model can be extended to apply to situa- 
tions where the current experiment has insufficient data 
in order to accurately estimate the B parameter. Further- 
more, estimating B with a large database Yd and applying 
it to a new dataset will avoid the problem of using the 
data twice. For instance, if we have a large database of 
expression measures Yd, we may estimate Bd such that 
Yd = BdYd+E, under the assumption that Bd ^ B. Given 
our estimate of Bd with rows fo^, we may then construct 
a linear predictor of each gene g in the new dataset Y by 
= $DgY. 

With a covariate matrix X that applies to the individual 
samples, we propose the model 

y g = pDgYvg + X'8 g + 6 = Y Ng v g + X'8 g + 6. (6) 

This equation differs from Equation 3 as $DgYvg = YNgVg 
is substituted for fi g Y and X'8 g is substituted for (j) g . The 
additional v g scalar parameter represents a very impor- 
tant increase in model flexibility compared to Cosgrove 
et al [9] where v g is constrained to be 1.0. In the new 
approach, v g is estimated for each gene so that if the his- 
torical network is inconsistent with the current network 
and is not a good predictor for gene expression in the cur- 
rent experiment, then its effect on inference approaches 
zero as v g does. The effect of the biological condition 
on gene g is described by the scalar 8 g) and this is dis- 
tinct from the usual formulation of differential expression 
that tests the effect of X (N x 1) alone on y g [11,12]. 
The 8g parameter corresponds to an element of the 
parameter, but unlike <p g in Equation 3, 8 g is jointly esti- 
mated with v g and undergoes formal statistical testing 
in our approach rather than acting merely as a basis for 
ranking candidates. Hence, the hypothesis H$ : 8 g = 0 
is the test of differential expression, conditional on the 
value of the historical gene network. The p-value can be 
derived through standard linear model theory, and the 
status of differential expression can be determined by set- 
ting a cutoff for p- values or false-discovery rates. The 
use of the database Yd to construct Y^ g (1 x iV) is key 
because it allows the network effect within Y to be esti- 
mated with only one additional unknown parameter v g , 
which is essential for datasets with relatively few biological 
replicates. It is noteworthy that the differential expression 
hypothesis using Equation 6 is an increase (or decrease) 
in a genes expression conditional on the gene-network, 
which is a somewhat different hypothesis than standard 
differential expression. Additionally, we further extend the 
previous model of [9] by considering interactions between 



the network prediction and the covariate matrix by defin- 
ing the 1 x N element- wise multiplicative interaction term 
(Y^g x X) within 

Jg = y KgVg + X'8 g + (Y Ng x X)y g + 6. (7) 

The tests of the scalar y g = 0 correspond to the test of dis- 
ruption of the gene network associated with the covariate 
X. By modeling the disruption of the network, we may test 
biologically interesting hypotheses and further improve 
the accuracy of predicting gene expression. 

Even with large gene expression databases, regulariza- 
tion is required for accurate parameter estimation due to 
the condition N G because of the large number of 
parameters in the model for each gene. The prior model 
[9] used the Lasso [13] that has anli penalty and produces 
a parsimonious model with many regression coefficients 
close to 0. We observed that the Lasso produced unsta- 
ble regression coefficients with split sample training and 
validation sets, but prediction accuracy was nevertheless 
maintained. In a parsimonious model, some transcripts 
from a set of correlated transcript predictors would have 
non-zero coefficients whereas the other transcripts in the 
set would have coefficients close to zero. In this manner, 
the predictive weight would be unevenly shared by the 
correlated transcripts. However, Ruan et al [6] obtained 
good success in predicting gene expression by using a 
k-nearest-neighbor (KNN) method. The KNN method 
used the average of co-expressed genes as a predictor, 
which gives equal predictive weight to a set of correlated 
transcripts. We suggest a balance between unequal, par- 
simonious (Lasso) and equal, robust (KNN) weight among 
correlated transcript predictors. One such compromise 
is the elastic net that combines penalties on the L\ and 
L2 norms. The elastic net, however, carries a substantial 
burden in computational cost and risk of overfitting by 
optimizing over two parameters. 

We offer a more efficient compromise and call our 
proposed approach Over-Shrinkage Ridge Regression 
(OSRR). The OSRR model is fit using standard ridge 
regression (RR) and ordinary least squares (OLS) imple- 
mentations. The gene expression values from the database 
Yd and the new data of interest Y are all centered at 
0, which is numerically convenient. The gene expression 
variances are not scaled to all have equal variance because 
the scale of variability is biologically relevant. We are 
applying the network estimates from one array platform 
to another, which are on the log-scale, and we assume 
that fold-changes (i.e., a doubling of gene-expression) will 
be meaningful across platforms. After centering, the first 
step of the model fit is the estimation pDg with penalty 
X based upon the database Yd. In this approach, the 
regularization penalty takes the same form as standard 
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ridge regression, and the first objective function to be 
minimized for each gene is 

\\y Dg - PogYof + m3 Dg f (8) 

where ||-|| 2 is the Euclidean norm. The fog are the ridge 
regression coefficients with yr> g as the outcome and Y' D _ g 
(Yr>- g = Yd without row g) as the covariate matrix. The 
identifiability of fog when X > 0 follows from the the- 
ory of ridge regression [14]. As X — >► oo, the regression 
parameter estimates fo g will tend towards 0 as the penalty 
becomes the dominant term. Hence, the larger the X, the 
more the fo g Y^ is biased towards the null compared to 
the observed database values yr>g. However, we introduce 
a second step and add another parameter v g to counter- 
act the bias towards the null in predicting the data from 
a small experiment Y with rows y g . The second step is 
the estimation of v g conditional on the fog that mini- 
mizes \\y g — fogYvg\\ 2 using ordinary least squares. If we 
let Yjsfg = fogY, then the estimation of v g can be seen 
as the univariate OLS coefficient of Y regressed onto Y^g. 
That is, the second objective function operates on the new 
data Y, and the estimate of row y g is fo g Yv g where v g 
minimizes 

\\y g -Y Ng v g || 2 (9) 

using ordinary least squares. The introduction of v g within 
a second stage of the regression has two effects. First, 
pOSRR _ p DgVg i s no longer biased towards the null as 
X oo. The larger X might make fogY smaller than 
y g by an order of magnitude, which is the motivation 
for the term "Over-Shrinkage". Second, larger values of 
X will act to equalize the elements of /3 g corresponding 
to correlated transcripts. This equalization property pro- 
duces a similar effect as KNN's equal weighting, but OSRR 
still allows data-driven deviation from equality and allows 
genes that are not the nearest neighbors to influence pre- 
diction. Hence, we call this a global network adjustment. 
An important difference between OSRR and approaches 
such as ridge regression, LASSO, and the elastic net is 
that these methods penalize the magnitude of the param- 
eter /3 norms with some function P(/3) whereas the OSRR 
approach considers a link function between the linear pre- 
dictor X/3 and the outcome Y. The link function g(X/3) 
is the second regression so that the model becomes Y = 
g(X/3), which may serve to minimize the effect of poor 
prediction. 

The robustness of the OSRR method to variable gene 
regulatory networks and applicability to smaller sample 
sizes are its attractive features. If the gene expression net- 
work estimate Bp from Yd results in a prediction Y^g that 
is uncorrected with the smaller independent dataset Y, 
then the regression parameter v g converges to 0 under 



standard linear model theory. This property is quite dis- 
tinct from the use of the LASSO model by [9] that does 
not consider larger training sets for estimating B or vari- 
able gene-gene dependencies due to the perturbation or 
biological state. The selection of X is based upon the max- 
imum correlation across all genes between the database 
prediction fogY and the current experiment Y. Overfit- 
ting may result because the OSRR is applied to the same 
data it is tuned with (i.e., X is selected based upon Y). 
OSRR largely avoids overfitting and gains simplicity by 
using a single tuning parameter X for all transcripts. The 
method was implemented in the R statistical software 
[15], and we fitted the model in equation 6 with the lme4 
R package [16]. A random intercept was used to account 
for the correlation within case-control pairs. The pro- 
grams, example simulated data, and a demonstration of 
the penalty selection are available for download from the 
supplemental website as Additional file 1. 

Simulation studies 

We performed a simulation study to illustrate the proper- 
ties of OSRR in the context of correlated covariates with 
similar effects on a continuous outcome. Here, we con- 
sider a single outcome y (n x 1) with two independent 
sets of correlated predictors X\ and X<i both with dimen- 
sion (N x G). The N = 201 rows of X\ and X2 are drawn 
from a compound-symmetric, multivariate Gaussian dis- 
tribution with dimension G = 100, mean 0 and variance 
1.0 with correlation p = 0.9. Both X\ and X2 represent 
two correlated sets of genes that have an influence on 
the expression of the gene y. Within these sets, the influ- 
ence of the genes is the same, which might reflect the 
activity of two pathways that are either associated with 
an increase or an inhibition of the expression of y. That 
is, y = X1B1 + X 2 B 2 + 6 where B x = [ft . . . ft]' and 
B2 = [ft • • • ft]' are G x 1 with identical elements ft = 1.0 
and ft = —1.0, respectively. The errors e have a Gaus- 
sian distribution with variance 1.0. The data were fit using 
3 models: ordinary least squares (OLS), Elastic Net, and 
the OSRR with X = 10, 000. The elastic net penalty was 
selected using K=5 fold cross-validation. The results are 
shown in Figure 1, which is a network representation of 
the coefficient estimates. All models perform similarly in 
terms of prediction. The estimates of £1 and £2 are highly 
variable for OLS. The elastic net predicts as well as other 
models, but the estimated coefficients are highly vari- 
able with some coefficients clustered around 0. To offset 
the Lasso parameter estimates close to 0, the remaining 
coefficients are estimated to be larger in magnitude than 
the true values. The OSRR exhibits a strong smoothing 
property that equalizes the influence that each of the cor- 
related predictors has, whereas the elastic net and the 
Lasso tend to reduce the number of influential predic- 
tors. This is an advantage in that it provides robustness, 
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Figure 1 Network depiction of regression estimation. The regression coefficients are shown as rays from the origin. The absolute value of the 
coefficient is the distance from the origin and the sign is indicated by blue (+) or red (-). There are 200 rays indicating 200 estimated coefficients, and 
the true values of the coefficients (+1,-1) appear as rings. The Ordinary Least Squares (OLS) estimates are highly variable because of the small sample 
size and the correlation between the positive and negative predictor groups. The elastic net has less variable estimates than OLS, but the OSRR 
coefficients demonstrate the the strong smoothing property of that equalizes the influence of predictors. 



but when there are sparse predictors or there is sufficient 
sample size to estimate the coefficients more precisely, 
then we anticipate that the OLS, elastic net or Lasso are 
advantageous. 

As a proof of principle, we performed another simu- 
lation study to demonstrate the advantage of the OSRR 
methodology applied to differential expression compared 
to a naive approach that does not account for prior 
knowledge about correlations between transcripts. We 
simulated a database Y D with G = 200 and N = 100. 
The transcripts were given a block diagonal correlation 
structure with 2 blocks of size 100 transcripts each with 
compound symmetry having variance 1 and correlation 
p = 0.8. Although this is a relatively simple multivariate 
Gaussian model compared to actual gene networks, it still 
represents a large dependency between genes. The OSRR 
model was fitted as described to estimate Bp with k = 
1000 selected out of the set {10 2 , 10 3 , 10 4 , 10 5 , 10 6 } based 
upon the out-sample prediction from training and valida- 
tion sets. We then simulated a smaller (N = 20) dataset 



Y with the same correlation structure. Ten of the arrays 
were under the control condition (xt = 0), and 10 were 
under a treatment condition (xi = 1). Differential expres- 
sion was simulated by randomly selecting genes with 
probability 0.2 and then adding or subtracting 8 g = 1.0 
to those genes. We compared the performance of the 
model in equation 6 to a model without correction for 
the network relationships Y^g. This naive model reduces 
to the t-test. The results are shown in Figure 2. The his- 
tograms for p-values corresponding to the null hypothesis 
(8g = 0) are given for both models, and the OSRR model 
clearly has smaller p-values for the differentially expressed 
(DE) genes compared to the naive model. Also, the naive 
model p-values do not have a uniform distribution for the 
not differentially expressed genes as the OSRR does. The 
lack of uniform distribution of the null p-values is due 
to the dependence structure and could inflate the Type I 
error. This is consistent with higher sensitivity and speci- 
ficity of OSRR relative to naive methods that do not utilize 
prior information on gene networks. 
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Figure 2 Proof-of-concept simulated differential expression experiment. Histograms of p-values are shown for both differentially expressed 
genes (DE) and not DE genes that follow the null hypothesis. The raw (naive) model has p-values for not DE genes that do not have the expected 
uniform distribution, though the p-values corresponding to the differentially expressed (DE) genes tend to be low. The OSRR model applied to the 
same dataset more clearly separates the DE from the not DE genes, and the distribution of the not DE genes is more uniform. 



Empiricial studies 

Chromosomal deletion study 

In our chromosome deletion study, we investigated the 
effects of 18q chromosomal deletions (deleted segments 
were the 60,021,550 — 76, 117, 153 bps of chromosome 
18) on cell lines derived from blood samples called lym- 
phoblastoid cell lines (LCLs). There were 10 case-control 
(c-c) pairs with the case subject having a deletion on 
chromosome 18 (18q-), and the control was a normal 
genotype, same-sex parent giving a total of N = 20 = 
2 x 10 subjects. We assayed each c-c pair with 4 Agi- 
lent 44K expression arrays. We used a dye-swap design 
such that for 2 arrays the case was labeled with Cy3 (con- 
trol labeled with Cy5) and 2 with case labeled with Cy5 
(control labeled with Cy3). Each case had associated clin- 
ical phenotypes including height, weight, age, and various 
laboratory measures of growth hormone responsiveness. 
The log-expression data for a given subject and gene are 
denoted as y grs where the indices g, r, and s represent gene, 
replicate, and spot respectively. The Cy3 and Cy5 channels 
were treated separately, and we did not use the log-ratio. 
For each subject, we averaged the replicates and the spots 
corresponding to the same gene yielding the N x 1 vector 
y g . We did not model the dye factor because it was bal- 
anced and not predictive. There is a corresponding N x 1 
covariate matrix X where %i = /(/^subject is a case). We 
also examined the p-values of the 8 g parameter from the 
model compared to a standard analysis with the LIMMA 
software [12]. 

For Yd, we used a much larger dataset from [17] who 
studied similar LCL cells that were treated with a variety 
of chemotherapies. There were a total of 374 Affymetrix 
chip assays in this dataset. We included all genes that were 
within the deleted region, but otherwise restricted our 
analysis to genes that had gene symbols in common to 
both the Affymetrix and the Agilent platforms and were 
expressed in the Affymetrix data. This leaves a total of 



5, 035 genes to estimate Bp. The expression values of the 
database genes were centered, but not scaled. A key issue 
is how to select the penalty parameter k, and the selec- 
tion process can be extremely computationally intensive 
as [9] used 200 CPU days for their cross-validation Lasso 
approach. To greatly simplify the selection of the penalty 
and minimize the tuning and hypothesis testing on the 
same dataset, we chose one penalty parameter for all genes 
and took a split sample approach so that the database 
was divided into two sets of 187 arrays. Further, we took 
into account three aspects of goodness-of-fit in assessing 
cross-validation. First, we considered prediction among 
the split samples and second, we considered the stability 
of the regression parameters Bp between the split sam- 
ples. Third, we examined the prediction of Bp with respect 
to our independent experiment. The computation time 
for selecting k took approximately 151,445 CPU seconds 
(8GB RAM, 2 Ghz) or about a 100 fold decrease in com- 
puting time relative to the Lasso approach for a similar 
sized dataset and computing environment [9]. The Based 
upon these three criteria, we selected k = 10 4 out of 
the set {10 2 , 10 3 , 10 4 , 10 5 , 10 6 }, and we applied the corre- 
sponding Bd to construct predictors Y^g = &DgY for y g 
from our expression experiment. Lastly, to further assess 
the robustness and Type I error, we performed a per- 
mutation analysis comparing OSRR and a conventional 
method. We permuted the case-control labels 100 times, 
and compared the histograms of the p-values from both 
methods. 

Results 

We observed that the network predictions Y^ g based 
upon Bd estimated from the database Yd were strongly 
correlated with new data y g . In Figure 3, the median 
correlation of Y Ng with y g is 0.54 so that 25% « V0.54 
of the variance of transcription regulation is explained. 
Note that the correlation of the observed expression with 
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Figure 3 Histogram of correlations of the predictor Y^ g based upon regulatory network from database Yd with gene expression in an 
independent experiment V. The positivity of the center of the distribution implies a dependency of gene expression that is consistent with the 
OSRR model prediction. 



the network predictor is expected to be > 0, and the 
mean correlation is significantly greater than 0 (p < 
0.0001). The degree of prediction is remarkable because 
Cosgrove et al [9] originally studied bacteria, whereas 
these data indicate that more complex human transcrip- 
tional patterns even on different microarray platforms are 
largely predictable. Nevertheless, for some genes, there 
is a negative correlation consistent with random noise. 
This underscores the importance of the estimation of v g 
in equation 6. That is, if the network model does not fit 
the smaller dataset, then v g is estimated to be close to 0 so 
that its effects are mitigated. 

Figure 4, shows the distribution of p-values for dif- 
ferential expression of transcripts under both the stan- 
dard LIMMA analysis and OSRR models. Under the null 
hypothesis p-values are uniformly distributed between 0 
and 1. The p-values of the LIMMA model do not appear 
to be for from uniformly distributed, but the OSRR model 
not only has more significant (p < 0.05) values, it also 
has a clear "spike" of small p-values towards 0. It is possi- 
ble to model these types of distributions with a beta and 
uniform mixture (BUM) as suggested by Pounds et al [18] 



where the beta component models the spike of p-values 
that corresponds to differentially expressed transcripts 
and the uniform component corresponds to the tran- 
scripts that follow the null hypotheses. The BUM fit is 
shown in Figure 4, and the ratio of the the uniform 
(blue) component over the mixture density (green) is the 
empirical Bayes probability estimate of the false posi- 
tive rate. The comparison of numbers of genes that at 
given p-values cutoffs is shown in Table 1. Clearly, the 
OSRR model identifies more than double (851 vs. 337) 
the number of significant transcripts, and those tran- 
scripts selected by the LIMMA approach are also selected 
by OSRR at a rate of 95% (319/337). The correlation 
between the — log(p-values) from LIMMA and OSRR is 
0.88, which indicates fairly good agreement in ranking the 
genes by probability of differential expression. The mean 
value for the false discovery rate (FDR) of the 21 genes 
in the deleted region in the chromosome according to 
LIMMA was 0.51 compared to the mean FDR of the q- 
value adjusted OSRR p-values of 0.17, indicating that the 
OSRR model is more likely to identify the deleted genes as 
differentially expressed. 



Limma P-Value OSRR P-Value 

oo 

CD 

CD 

Q 

CM 
O 

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

P Values P values 

Figure 4 Histogram of p-values generated by conventional LIMMA and OSRR methods applied to the chromosomal deletion study. Note 
that LIMMA produces p-values that differ only slightly from a uniform distribution, but the OSRR method gives more significant p-values. The 
beta-uniform mixture model is shown in green and the uniform component is blue. The ratio of blue/green is the empirical Bayes probability of the 
false positive rate at a given p-value. This ratio approaches 0.5 for conventional analysis and it approaches 0.25 for the OSRR method. 
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Table 1 Number of significant transcripts in standard 
LI MM A and OSRR models at different p-value cutoffs 



Up/Down 


P-Value 


LIMMA 


OSRR 


Both 


Down 


0.0500 


94 


209 


92 




0.0100 


25 


79 


25 




0.0010 


9 


19 


9 




0.0001 


5 


8 


5 


Up 


0.0500 


241 


642 


227 




0.0100 


46 


251 


42 




0.0010 


6 


52 


5 




0.0001 


2 


15 


2 



Up/Down corresponds to higher/lower expression in the cases relative to 
controls. 



Not only was the OSRR method shown to identify twice 
the number of differentially expressed genes, but the log 
fold-change estimates are more stable, and therefore less 
likely to be false positives based upon the split-sample 
analysis. We randomly split the data set 20 times into two 
sub-datasets of 5 families each, and compared the results 
of OSRR and the t-test from the two sub-datasets of the 
partition in estimating the log-fold change between cases 
and controls. Note that LIMMA is equivalent to the t-test 
in estimating log-fold change. Table 2 lists the correla- 
tions between the in-sample and out-sample comparisons 
of the differential expression estimates of the conven- 
tional model <^ td and the OSRR model 8 g . The OSRR 
estimate was much more reproducible than the conven- 
tional estimate in independent samples with correlations 
of 0.321 (OSRR) vs. 0.201 (Standard) (95% CI for dif- 
ference [0.054,0.183], p < 0.001). Furthermore, the 
out-sample results clearly show that OSRR is more con- 
sistent with both OSRR and conventional estimates in 
independent split datasets. 

We compared the performance of the elastic net 
predictions with OSRR using a subset of 500 genes 
because of the computational time of the elastic net in 
this context. We selected a gene-specific penalty as did 
Cosgrove et al [9] who used the Lasso. We compared the 
average correlation of the predictions with the expression 



Table 2 Reliability assessed by a repeated split-sample 
analysis 



In/Out Sample 


Type 


Mean correlation 


Std Dev 


In-Sample 


8 g vs8f d 


0.917 


0.034 


Out-Sample 


8* d vs Sf d 


0.203 


0.115 


Out-Sample 


8 g vs 8 g 


0.321 


0.073 


Out-Sample 


8g VS ^ 


0.313 


0.089 



The parameters 8 g and 8g td denote the differential expression estimates for 
OSRR and standard methods. 



of the 500 genes in our smaller dataset, and we found 
that the mean correlation was 0.01 95% CI [-0.01,0.03] 
(paired t-test: p = 0.43) higher for the OSRR model. This 
is consistent with OSRR allowing equal prediction at a 
fraction of the computational cost of the elastic net. 

The permutation study results are shown in Figure 5. 
The resulting estimates of the null distributions are 
different for the conventional and the OSRR model. The 
LIMMA model has a conservative Type I error rates indi- 
cated by nonuniform p-value distributions with a dip near 
0, and the OSRR model has a slightly inflated Type I 
error with more smaller p-values than uniform. The OSRR 
model p-values are quite different from the non-permuted 
data in Figure 4. In contrast, the conventional analysis 
null distribution is more similar to the non-permuted 
distribution of p-values, which suggests a low signal-to- 
noise ratio. This is consistent with the OSRR model hav- 
ing a reasonably controlled Type I error rate, but much 
greater power to detect differences between the cases and 
controls. 

We also applied the interaction model in Equation 7. 
A substantial proportion (10%, 484/5035) of genes had 
interactions (Ho : y g = 0 with p < 0.01) between the 
dependence on the gene network and case-control status. 
This implies that the regulatory relationships are different 
for particular genes between cases and controls. The gene 
with the most significant interaction effect was the GARS 
gene, which has been linked to severe neurological con- 
ditions [19]. Interestingly, 93% of the 100 most significant 
interactions were negative (y g < 0), which implies that 
the expression predictions are substantially less positively 
correlated with the observed expression in the abnormal 
cases than controls. See Additional file 2. This is consis- 
tent with overall disruption of the regulatory networks in 
the cases. Such observations would be impossible using 
standard differential expression models that only examine 
mean differences. 

For the purpose of comparison, we applied the method 
of Cosgrove et al [9] to this dataset without the use of 
the larger training dataset using the elastic net estima- 
tors. Their method was not intended to be applied to 
datasets of this size (2 conditions (case/control) and 10 
paired samples), but we wanted to investigate the neces- 
sity of using the database in a real example. Despite 5-fold 
cross-validation for the selection of the tuning parame- 
ters, there was overfitting of the gene network so that the 
estimates BY from the fit had a median correlation with 
Y of > 0.95 and the analysis of the residuals suggested 
in Equation 5 yielded p-values that did not differ from a 
uniform distribution (Data not shown). No differentially 
expressed gene could be identified due to the confounding 
of the gene expression network BY and the case-control 
status. However, we emphasize that their method was not 
intended for small datasets. 
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Permuted: Limma P-Value Permuted: OSRR P-Value 




0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

P Values P Values 

Figure 5 Histogram of the p-values of the from the permuted chromosome deletion study. The conventional analysis method has a slightly 
conservative, non-uniform null distribution. Whereas, the OSRR method has slightly anti-conservative, non-uniform null distribution. 



Circadian rhythm study 

As a proof of principle, we considered another dataset to 
see if the OSRR model had similar advantages in a mouse 
model. This dataset [20] consisted of a time-series of brain 
tissue collected in mice with two different genotypes (wild 
type and a mutation of the circadian rhythm "clock" gene). 
The brain tissue samples were measured on Affymetrix 
tissue arrays in duplicate every two hours for 44 and 24 
hours for a total of 24 and 14 arrays, respectively, for 
wild type and mutant mice. We estimated the B matrix 
using a database Yd from an experiment with the same 
Affymetrix platform with 122 samples of mixed tissue 
types [21]. We assessed the 3,000 genes with the high- 
est variance in the circadian dataset. We fit a sinusiodal 
model with and without network adjustment term Y^ g v g 
to identify genes associated with the circadian rhythm 

y g = Y^gVg + ag cos(T + b g ) + d + 6 (10) 

where T (1 x N) is the time covariate in radians, d is 
an intercept, and the time dependencies are tested with 
scalars a g and b g (Ho : a g = b g = 0) using a log-likelihood 
test assuming that e is Gaussian noise. We did not use 



the information about the genotype of the mice within the 
models. Rather, we tested whether or not the OSRR model 
could account for the biological variation due to different 
genetic backgrounds and still identify time dependent 
genes. The results are shown in Figures 6 and 7. Figure 6 
shows the unpermuted data analysis in both models with 
the OSRR model having a higher proportion of signifi- 
cant genes. Figure 7 shows the null distributions under 
permutations, and the null distributions of the two mod- 
els are very similar. This indicates that the OSRR model 
has more power to detect time-dependent genes with a 
similar Type I error rate as the conventional model. We 
also compared the performance of the elastic net and the 
OSRR gene expression predictions by the average corre- 
lations of the network predictors Y^ g with the outsample 
gene expression y g for a random subset of 200 genes. We 
found that the OSRR method had a 0.1 higher average 
correlation (95% CI [0.05,0.15] with a paired t-test: p = 
0.0003). That is, the OSRR had better prediction that the 
elastic net model for less computational costs; however, 
this advantage in prediction may be due to the smaller 
sample size of the training set (N = 122) relative to the 
chromosomal deletion training set (N = 374) for which 
the prediction of the two methods was approximately 
equivalent. 
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Figure 6 Histogram of p-values generated by conventional raw and OSRR method of the circadian rhythm study. The beta-uniform mixture 
model is shown in green and the uniform component is blue. The OSRR model identifies approximately double the number of significant genes. 
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Discussion 

We have utilized the estimates of gene-networks from 
large databases in the analysis of small, independent 
datasets assayed on a different microarray type to suc- 
cessfully predict about 25% of the variation in transcript 
expression. The prediction is shown to markedly increase 
the sensitivity and the reliability of detecting differentially 
expressed transcripts in two different datasets. These 
inferences are different from the standard differential 
expression analyses because they reflect an adjustment 
based upon a regulatory model for each gene, and by test- 
ing an interaction term, one may make a statistical assess- 
ment of whether the regulatory network has changed 
between biological states. This method is approximately 
100 times computationally faster than the previously 
reported method [9]. The majority of the computational 
time is spent fitting predictive models for each gene inde- 
pendently, and this time is multiplied by the density of 
the grid of the tuning parameter(s). OSRR has the advan- 
tages that the tuning parameter grid can be sparse and 
that the algorithms for fitting linear models are relatively 
well-optimized. The computation can be accelerated by 
parallelizing the fit, but parallel analyses are nontrivial 
because of large memory requirements of the operations 
(i.e., solving linear equations with 5,000 variables). Atten- 
tion should also be paid to improving the computational 
efficiency of more complex models such as the elastic net. 

Unlike the previous model, the use of the network 
predictor as a covariate in the differential expression 
regression model also provides robustness against poor 
prediction of particular transcripts. Given OSRRs ease 
of implementation and its robustness, there is a broad 
set of potential applications to small sample size expres- 
sion experiments that leverage the growing large-scale 
gene expression databases such as the Gene Expression 
Omnibus (GEO) [22]. We recommend that researchers 
select relevant datasets to construct network models 
by considering the species, type of tissue or cells, the 
microarray platform, and other sources of variation. We 



have seen that if the database Yd used to estimate Bd has a 
different tissue type than the independent dataset Y, then 
the method tends to be less effective in accounting for 
variation. 

Conclusion 

We have developed a novel adaptation of ridge-regression 
called OSRR that robustly estimates models of transcrip- 
tional co-expression networks based upon large microar- 
ray experiments. There are many possibilities for future 
research. Because of the robustness across microarray 
platforms, the OSRR approach can applied to RNA 
sequencing (RNA-Seq) data as well. In preliminary stud- 
ies, we have found that the predictions of OSRR derived 
from microarray data perform similarly when applied to 
RNA-seq data. As we suggested with Equation 7, we 
can consider tissue specific modulations of regulatory 
networks as an extension of the OSRR model. Different 
tissue types or experimental conditions may induce differ- 
ent correlations between genes, and this fact is utilized in 
the COXEN model in [23] for disease classification pur- 
poses. These authors found that genes have different cor- 
relations in different cell types, and that genes that have 
shared correlation structures between two subtypes can 
predict how those subtypes will respond to chemotherapy. 
Also, the steps in the two-step fit are suggestive of levels 
within a hierarchical model. We can use this framework 
to extend the prediction model by including extra terms. 
For example, some genes may be better predicted using 
another network model with prediction Yn2#- To account 
for this, we propose 

y g = Y Ng v Ng + Y mg v N2g +X8 g + 6. (11) 

If the other network model is a superior predictor, then 
this will be reflected by the vn2# parameter. However, the 
more terms within the model, the more biological repli- 
cates are required for model stability. We may include 
information a known network of specific genes in a sim- 
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ilar manner. We could further address another aspect of 
the SSEM problem. That is, the structure among the rows 
of B was not modeled. For example, if a gene had a rela- 
tively large influence on many other genes, then one could 
use this pattern for better estimation of B. This may poten- 
tially be achieved using a similar approach of Friedman 
et al [24] in the construction of the graphical Lasso. 

Additional files 



Additional file 1 : File with link to R implementation programs. 

Additional file 2: File with estimation results from chromosome 1 8 
study for alterations in gene networks in cases relative to controls. 
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